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Abstract 



We study the tracking problem, namely, estimating the hidden state of an object 
over time, from unreliable and noisy measurements. The standard framework for 
the tracking problem is the generative framework, which is the basis of solutions 
such as the Bayesian algorithm and its approximation, the particle filters. How- 
ever, the problem with these solutions is that they are very sensitive to model 
mismatches. 

In this paper, motivated by online learning, we introduce a new framework - an 
explanatory framework - for tracking. We provide an efficient tracking algorithm 
for this framework. We provide experimental results comparing our algorithm 
to the Bayesian algorithm on simulated data. Our experiments show that when 
there are slight model mismatches, our algorithm vastly outperforms the Bayesian 
algorithm. 



1 Introduction 

We study the tracking problem, which has numerous applications in AI, control and finance. In 
tracking, we are given noisy measurements over time, and the problem is to estimate the hidden 
state of an object. The challenge is to do this reliably, by combining measurements from multiple 
time steps and prior knowledge about the state dynamics, and the goal of tracking is to produce 
estimates that are as close to the true states as possible. 

The most popular solutions to the tracking problem are the Kalman filter fT\, the particle filter fl], 
and their numerous extensions and variations {e.g. \3 4|), which are based on a generative frame- 
work for the tracking problem. Suppose we want to track the state xt of an object at time t, given 
only measurement vectors M{-,t') for times t' < t. In the generative approach, we think of the 
state X{t) and measurements M{-,t) as random variables. We represent our knowledge regarding 
the dynamics of the states using the transition process Pr{X {t)\X{t — 1)) and our knowledge re- 
garding the (noisy) relationship between the states and the observations by the measurement process 
Pi{M{-,t)\X{t)). Then, given only the observations, the goal of tracking is to estimate the hid- 
den state sequence {xi, X2, ■ ■ ■)■ This is done by calculating the likelihood of each state sequence 
and then using as the estimate either the sequence with the highest posterior probability (maximum 
a posteriori, or MAP) or the expected value of the state with respect to the posterior distribution 
(the Bayesian algorithm). In practice, one uses particle filters, which are an approximation to the 
Bayesian algorithm. 

The problem with the generative framework is that in practice, it is very difficult to precisely de- 
termine the distributions of the measurements. Moreover, the Bayesian algorithm is very sensitive 
to model mismatches, so using a model which is slightly different from the model generating the 
measurements can lead to a large divergence between the estimated states and the true states. 

To address this, we introduce an online-leaming-based framework for tracking. In our framework, 
called the explanatory framework, we are given a set of state sequences or paths in the state space; 
but instead of assuming that the observations are generated by a measurement model from a path 
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in this set, we think of each path as a mechanism for explaining the observations. We emphasize 
that this is done regardless of how the observations are generated. Suppose a path {xi,X2, ■ ■ ■) is 
proposed as an explanation of the observations (M(-, 1), M{-, 2), . . .). We measure the quality of 
this explanatory path using a predefined loss function, which depends only on the measurements 
(and not on the hidden true state). The tracking algorithm selects its own explanatory path by taking 
a weighted average of the best explanatory paths according the past observations. The theoretical 
guarantee we provide is that the loss of the explanatory path generated in this online way by the 
tracking algorithm is close to that of the explanatory path with the minimum such loss; here, the 
loss is measured according to the loss function supplied to the algorithm. Such guarantees are 
analogous to competitive analysis used in online learning 15]|6]|71, and it is important to note that 
such guarantees hold uniformly for any sequence of observations, regardless of any probabilistic 
assumptions. 

Our next contribution is to provide an online-learning-based algorithm for tracking in the explana- 
tory framework. Our algorithm is based on NormalHedge fSl, which is a general online learning 
algorithm. NormalHedge can be instantiated with any loss function. When supplied with a bounded 
loss function, it is guaranteed to produce a path with loss close to that of the path with the minimum 
loss, from a set of candidate paths. As it is inefficient to directly apply NormalHedge to track- 
ing, we derive a Sequential Monte Carlo approximation to NormalHedge, and we show that this 
approximation is efficient. 

To demonstrate the robustness of our tracking algorithm, we perform simulations on a simple one- 
dimensional tracking problem. We evaluate tracking performance by measuring the average distance 
between the states estimated by the algorithms, and the true hidden states. We instantiate our algo- 
rithm with a simple clipping loss function. Our simulations show that our algorithm consistently 
outperforms the Bayesian algorithm, under high measurement noise, and a wide range of levels of 
model mismatch. 

We note that Bayesian algorithm can also be interpreted in the explanatory framework. In particular, 
if the loss of a path is the negative log-likelihood (the log-loss) under some measurement model, 
then, the Bayesian algorithm can be shown to produce a path with log-loss close to that of the path 
with the minimum log-loss. One may be tempted to think that our tracking solution follows the same 
approach; however, the point of our paper is that one can use loss functions that are different from 
log-loss, and in particular, we show a scenario in which using other loss functions produces better 
tracking performance than the Bayesian algorithm (or its approximations). 

The rest of the paper is organized as follows. In Section 2, we explain in detail our explanatory model 
for tracking. In Section 3, we present NormalHedge, on which our tracking algorithm is based. In 
Section 4, we provide our tracking algorithm. Section 5 presents the experimental comparison of 
our algorithm with the Bayesian algorithm. Finally, we discuss related work in Section 6. 

The detailed bounds and proofs for NormalHedge are provided in the supplementary material. We 
feel that the algorithm NormalHedge may be of more general interest, and hence these details for 
NormalHedge have been submitted to NIPS in a companion paper. 

2 The explanatory framework for tracking 

In this section, we describe in more detail the setup of the tracking problem, and the explanatory 
framework for tracking. In tracking, at each time t, we are given as input, measurements (or obser- 
vations) A/(-, t), and the goal is to estimate the hidden state of an object using these measurements, 
and our prior knowledge about the state dynamics. 

In the explanatory framework, we are given a set V of paths (sequences) over the state space X C 
M". At each time t, we assign to each path inV a loss function £. The loss function has two parts: a 
dynamics loss 1,^ and an observation loss i^. 

The dynamics loss £d captures our knowledge about the state dynamics. For simplicity, we use a 
dynamics loss id that can be written as 

t 



2 



for a path p — {xi,X2, ■ ■ ■)■ In other words, the dynamics loss at time t depends only on the states 
at time t and t — 1. A common way to express our knowledge about the dynamics is in terms of a 
dynamics function F, defined so that paths with Xt « F{xt-i) will have small dynamics loss. 

For example, consider an object moving with a constant velocity. Here, if the state xt — {p,v), 
where p is the position and v is the velocity, then we would be interested in paths in which xt w 
Xt-i + {v, 0). In these cases, the dynamics loss £d{xt,Xt-i) is typically a growing function of the 
distance from xt to F{xt-i). 

The second component of the loss function is an observation loss io- Given a path p — {xi, X2, ■ ■ •), 
and measurements M = (M(-, 1), M(-, 2), . . .), the observation loss function io{p, M) quantifies 
how well the path p explains the measurements. Again, for simplicity, we restrict ourselves to loss 
functions £o that can be written as: 

4(p,M) = ^4(xt,M(.,i)). 
t 

In other words, the observation loss of a path at time t depends only on its state at time t and the 
measurements at time t. The total loss of a path p is the sum of its dynamics and observation losses. 
We note that the loss of a path depends only on that particular path and the measurements, and not 
on the true hidden state. As a result, the loss of a path can always be evaluated by the algorithm at 
any given time. 

The algorithmic framework we consider in this model is analogous to, and motivated by the decision- 
theoretic framework for online learning |6, 5|. At time t, the algorithm assigns a weight to 
each path p in T'. The estimated state at time t is the weighted mean of the states, where the 
weight of a state is the total weight of all paths in this state. The loss of the algorithm at time t 
is the weighted loss of all paths in V. The theoretical guarantee we look for is that the loss of the 
algorithm is close to the loss of the best path in V in hindsight (or, close to the loss of the top e- 
quantile path in V in hindsight). Thus, if V has a small fraction of paths with low loss, and if the 
loss functions successfully capture the tracking performance, then, the sequence of states estimated 
by the algorithm will have good tracking performance. 

3 NormalHedge 

In this section, we describe the NormalHedge algorithm. To present NormalHedge in full generality, 
we first need to describe the decision-theoretic framework for online learning. 

The problem of decision-theoretic online learning is as follows. At each round, a learner has access 
to a set of N actions; for our purposes, an action is any method that provides a prediction in each 
round. The learner maintains a distribution Wi,t over the action at time t. At each time period t, each 
action i suffers a loss li t which lies in a bounded range, and the loss of the learner is Wi,t(i,t- 
We notice that this framework is very general - no assumption is made about the nature of the 
actions and the distribution of the losses. The goal of the learner is to maintain a distribution over 
the actions, such that its cumulative loss over time is low, compared to the cumulative loss of the 
action with the lowest cumulative loss. In some cases, particularly, when the number of experts is 
very large, we are interested in acheiving a low cumulative loss compared to the top e-quantile of 
actions. Here, for any e, the top e-quantile of actions are the e fraction of actions which have the 
lowest cumulative loss. 

Starting with the seminal work of Littlestone and Warmuth [|71, the problem of decision-theoretic 
online learning has been well-studied in the literature ll6l|9]|5]. The most common algorithm for this 
problem is Hedge or Exponential Weights |6|, which assigns to each action a weight exponentially 
small in its total loss. In this paper however, we consider a different algorithm NormalHedge for 
this problem |8|, and it is this algorithm that forms the basis of our tracking algorithm. While the 
Bayesian averaging algorithm can be shown to be a variant of Hedge when the loss function is the 
log-loss, such is not the case for NormalHedge, and it is a very different algorithm. A significant 
advantage of using NormalHedge is that it has no parameters to tune, yet acheives performance 
comparable to the best performance of previous online learning algorithms with optimally tuned 
parameters. 
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In the NormalHedge algorithm, for each action i and time t, we use Wi^t to denote the NormalHedge 
weight assigned to action i at time t. At any time t, we define the regret Ri t of our algorithm to 
an action i as the difference between the cumulative loss of our algorithm and the cumulative loss 
of this action. Also, for any real number x, we use the notation [x]+ to denote max(0,x). The 
NormalHedge algorithm is presented below. 



Algorithm 1 NormalHedge 
initialize Ri^q = 0, Wi^i = l/N\/i 



for t = 1,2, . . . do 

Each action i incurs loss ti^f 



Learner incurs loss ij^^t — Wi^t^i^t- 
Update cumulative regrets: Ri t — Ri,t-i + {^A.t — ^i.t) Vi. 

Find ct > satisfying Yl!i=i e^P 



let 

6: Update distribution: w^,t+\ oc ^^;''^+ exp { ^"^^^If ^ 
7: end for 



Vi. 



The performance guarantees for the NormalHedge algorithm, as shown by |8| can be stated as 
follows. 

Theorem 1. If NormalHedge has access to N actions, then for all loss sequences, for all t, for all 
< e < 1, the regret of the algorithm to the top e-quantile of the actions is 0{\/t ■ ln(l/e) + ln^ N). 



Note that the actions which have total loss greater than the total loss of the algorithm, are assigned 
zero weight. Since the algorithm performs almost as well as the best action, in a scenario where 
a few actions are significantly better than the rest, the algorithm will assign zero weight to most 
actions. In other words, the support of the NormalHedge weights may be a very small set, which 
can significantly reduce its computational cost. 



4 Tracking using NormalHedge 

To apply NormalHedge directly to tracking, we set each action to be a path in the state space, 
and the loss of each action at time t to be the loss of the corresponding path at time t. To make 
NormalHedge more robust in a practical setting, we make a small change to the algorithm: instead 
of using cumulative loss, we use a discounted cumulative loss. For a discount parameter < a < 1, 
the discounted cumulative loss of an action i at time T is X]tLi(l ~ Oi)'^^*ii,t- Using discounted 
losses is common in reinforcement learning |10|; intuitively, it makes the tracking algorithm more 
flexible, and allows it to more easily recover from past mistakes. 

However, a direct application of NormalHedge is prohibitively expensive in terms of computation 
cost. Therefore, in the sequel, we show how to derive a Sequential Monte-Carlo based approxima- 
tion to NormalHedge, and we use this approximation in our experiments. 

The key observation behind our approximation is that the weights on actions generated by the Nor- 
malHedge algorithm induce a distribution over the states at each time t. We therefore use a random 
sample of states in each round to approximate this distribution. Thus, just as particle filters approx- 
imate the posterior density on the states induced by the Bayesian algorithm, our tracking algorithm 
approximates the density induced on the states by NormalHedge for tracking. 

The main difference between NormalHedge and our tracking algorithm is that while NormalHedge 
always maintains the weights for all the actions, we delete an action from our action list when its 
weight falls to 0. We then replace this action by our resampling procedure, which chooses another 
action which is currently in a region of the state space where the actions have low losses. Thus, we 
do not spend resources maintaining and updating weights for actions which do not perform well. 
Another difference between NormalHedge and our tracking algorithm is that in our approximation, 
we do not explicitly impose a dynamics loss on the actions. Instead, we use a resampling procedure 
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Algorithm 2 Tracking algorithm 



input N (number of actions), a (discount factor), (resampling parameter) F (dynamics func- 
tion) 

1: A := {xi,i, . . . , x^v.i} with Xi,i randomly drawn from X; Ri o :— 0; Wi^o := 1/N \/i 
2: for t = 1, 2, ... do 

3: Obtain losses £i,t — £o{xi.t) for each action i and update regrets: 

Ri,t ■■= (1 - a)-Ri,t-i + {f A,t - ii,t) where ^a,* = Y^iLi Wt.t-i^t.t- 
4: Delete poor actions: let X ^ {i : Ri^t < 0}, set A:= A\X. 
5: Resample actions: A := ALi Resample(X, !]*,<). 

6: Compute weight of each action i: Wi^t oc ^^^•^^'^ exp (^ ^^'^'2^^"' ^ 

where c is the solution to the equation J2iLi ^^P (^^^'^^2^r^) = ^■ 

7: Estimate: x^,* := Wi,tXi,f 
8: Update states: Xi^t+i := F{xi^t) Vi. 
9: end for 



Algorithm 3 Resampling algorithm 

input X (actions to be resampled), (resampling parameter), t (current time) 
1: for j e X do 

2: Set X := {i : R,^t > 0}. 

3: If X = 0: set pi ^ 1/N Vi. Else: set p, cx Wi,t-i Vi G X and = Vi ^ X. 
4: Draw i^^ {pi,... ,Pn). 

5: Draw Xj^t ^ Af{xi^t,^*), and set Rj,t := (1 - a)i?i,t-i + {iA,t - io{xj,t))- 
6: end for 



Figure 1 : NormalHedge-based tracking algorithm. 



that only considers actions with low dynamics loss. This also avoids spending resources on actions 
which have high dynamics loss anyway. 

Our tracking algorithm is specified in Algorithm [2] Each action i in our algorithm is a path 
{xi^i, Xi,2, . . .) in the state space X C M". However, we do not maintain this entire path explic- 
itly for each action; rather. Step 8 of the algorithm computes Xi^t+i from Xi,t using the dynamics 
function F, so we only need to maintain the current state of each action. Recall, applying the dy- 
namics function F should ensure that the path incurs no or little dynamics loss (see Section|2]i. 

We start with a set of actions A initially positioned at states uniformly distributed over the X, and a 
uniform weighting over these actions. In each round, like NormalHedge, each action incurs a loss 
determined by its current state, and the tracker incurs the expected loss determined by the current 
weighting over actions. Using these losses, we update the cumulative (discounted) regrets to each 
action. However, unlike NormalHedge, we then delete all actions with zero or negative regret, and 
replace them using a resampling procedure. This procedure replaces poorly performing actions 
with actions currently at high density regions of X, thereby providing a better approximation to the 
intended weights. 

The resampling procedure is explained in detail in Algorithm |3] The main idea is to sample from 
the regions of the state space with high weight. This is done by sampling an action proportional 
to its weight in the previous round. We then choose a state randomly (roughly) from an ellipsoid 
{x : {x — xt)^Y^^^{x — xt) < 1} around the current state xt of the selected action; the new action 
inherits the history of the selected action, but has a current state which is different from (but close 
to) the selected action. This latter step makes the new state distribution smoother than the one in the 
previous round, which may be supported on just a few states if only a few actions have low losses. 
We note that can be set so that the resampling procedure only samples actions with low dynamics 
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Blue 



p = 0, Red: p = 0.2. 



Figure 2: Plots of the measurements (as a function of x) for p = and — 1 and the density of 
the noise nt{x). 

loss (and Step 8 of the algorithm ensures that the remaining actions in the set A do not incur any 
dynamics loss); thus, our algorithm does not explicitly compute a dynamics loss for each action. 

5 Simulations 

For our simulations, we consider the task of tracking an object in a simple, one-dimensional state 
space. To evaluate our algorithm, we measure the distance between the estimated states, and the true 
states of the object. 

Our experimental setup is inspired by the application of tracking faces in videos, using a standard 
face detector 111]. In this case, the state is the location of a face, and each measurement corresponds 
to a score output by the face detector for a region in the current video frame. The goal is to predict 
the location of the face across several video frames, using these scores produced by the detector 
The detector typically returns high scores for several regions around the true location of a face, but it 
may also erroneously produce high scores elsewhere. And though in some cases the detection score 
may have a probabilistic interpretation, it is often difficult to accurately characterize the distribution 
of the noise. 

The precise setup of our simulations is as follows. The object to be tracked remains sta- 
tionary or moves with velocity at most 1 in the interval [—500,500]. At time t, the 
true state is the position Zt, the measurements correspond to a 1001-dimensional vector 
M.{t) = [M(-500,t),M(-499,i),...,Af(499,t),A/(500,t)] for locations in a grid G 
{—500, —499, . . . , 499, 500}, generated by an additive noise process 

M (x, t) = H{x, Zt) + nt{x). 

Here, H{x, zt) is the square pulse function of width 2W around the true state zt- H{x, Xt) = 1 if 
|a; — Zt\ <W and otherwise (see Figure|2] left). The additive noise nt{x) is randomly generated 
independently for each t and each x E G, using the mixture distribution 



signal, and the parameter p represents the fraction of outliers. In our experiments, we fix = 50 
and vary cTq and p. The total number of time steps we track for is T = 200. 

In the generative framework, the dynamics of the object is represented by the transition model 
Xt+i ~ Af{xt,(j'^), and the observations are represented by the measurement process M{x,t) ^ 
Af{H{x, Xt), CTq). Thus, when p — 0, the observations are generated according to the measurement 
process supplied to the generative framework; for p > 0, a p fraction of the observations are outliers. 

For the explanatory framework, the expected state dynamics function F is the identity function, and 
the observation loss of a path p — {xi , a;2, • . .) at time t is given by 




lo{xt,M{-,t)) 



E 



q{M{x,t)) 



xelxt-w,xt+w]nG 
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Table 1: Experimental Results. Root-mean-squared-errors of the predicted positions over T = 200 
time steps for our algorithm (NH), the Bayesian algorithm, and the particle filter (PF). The reported 
values are the averages and standard deviations over 100 simulations. 



Low Noise {oo ~ 1) High Noise (cto — 8) 



p 


NH 


Bayes 


PF 




P 


NH 


Bayes 


PF 


0.00 


3.18 ±0.33 


1.17 ± 0.09 


1.23 ± 0.11 




0.00 


10.93 ± 2.52 


10.98 ± 2.33 


14.35 ± 5.16 


0.01 


3.21 ± 0.34 


1.90 ± 0.25 


3.98 ± 1.06 




0.01 


11.26 ±3.43 


12.76 ±3.07 


44.29 ± 16.7 


0.05 


3.26 ± 0.34 


3.99 ± 0.52 


81.70 ± 1.74 




0.05 


12.03 ± 3.47 


19.75 ±6.70 


81.70 ± 1.74 


0.10 


3.31 ±0.35 


6.40 ± 0.84 


81.70 ± 1.74 




0.10 


12.25 ± 2.93 


27.33 ± 10.9 


81.70 ± 1.74 


0.15 


3.42 ± 0.34 


8.38 ± 1.10 


81.70 ± 1.74 




0.15 


13.38 ±3.07 


32.78 ± 13.1 


81.70 ± 1.74 


0.20 


3.52 ±0.41 


10.28 ± 1.24 


81.70 ± 1.74 




0.20 


14.15 ±3.88 


43.99 ± 26.8 


81.70 ± 1.74 



where q{y) = min(l + (To, ma.x{y, —<To)) clips the measurements to the range [— Co, I + CTo]- That is, 
the observation loss for xt with respect to M{-, t) is the negative sum of thresholded measurement 
values q{M{x, t)) for x in an interval of width 2W around Xf . 

Given only the observation vectors M, we use three different methods to estimate the true underlying 
state sequence {zi, Z2, ■ ■ ■)■ The first is the Bayesian algorithm, which recursively applies Bayes' 
rule to update a posterior distribution using the transition and observation model. The posterior 
distribution is maintained at each location in the discretization G. For the Bayesian algorithm, we 
set (To to the actual value of <Jo used to generate the observations, and we set — 2. The value 
of ad was obtained by tuning on measurement vectors generated with the same true state sequence, 
but with independently generated noise values. The prior distribution over states assigns probability 
one to the true value of zi (which is in our setup) and zero elsewhere. The second algorithm is 
our algorithm (NH) described in Section |4] For our algorithm, we use the parameters E* = 400 
and a — 0.02. These parameters were also obtained by tuning over a range of values for and 
a. We also compare our algorithm with the particle filter (PF), which uses the same parameters as 
with the Bayesian algorithm, and predicts using the expected state under the (approximate) posterior 
distribution. For our algorithm, we use N — 100 actions, and for the particle filter, we use N = 100 
particles. For our experiments, we use an implementation of the particle filter due to 1 12 1. 

Figure [5] shows the true state and the states predicted by our algorithm (Blue) and the Bayesian 
algorithm (Red) for two different values of Co for 5 independent simulations. Table [T] summarizes 
the performance of these algorithms for different values of the parameter p, for two different values 
of the noise parameter ao- We report the average and standard deviation of the RMSE (root-mean- 
squared-error) between the true state and the predicted state. The RMSE is computed over the 
T — 200 state predictions for a single simulation, and these RMSE values are averaged over 100 
independent simulations. 

Our experiments show that the Bayesian algorithm performs well when p = 0, that is, it is supplied 
with the correct noise model; however, its performance degrades rapidly as p increases, and becomes 
very poor even at p = 0.2. On the other hand, the performance of our algorithm does not suffer 
appreciably when p increases. The degradation of performance of the Bayesian algorithm is even 
more pronounced, when the noise is high with respect to the signal (ctq = 8). The particle filter 
suffers a even higher degradation in performance, and has poor performance even when p = 0.01 
(that is, when 99% of the observations are generated from the correct likelihood distribution supplied 
to the particle filter). Our results indicate that the Bayesian algorithm is very sensitive to model 
mismatches. On the other hand, our algorithm, when equipped with a clipped-loss function, is 
extremely robust to model mismatches. In particular, our algorithm provides a RMSE value of 19.6 
even under high noise (CTq ~ 8), when p is as high as 0.4. 

Some additional experiments with our algorithm are included in the supplementary appendix; they 
illustrate how the performance of our algorithm varies with the parameters E* and a, and tabulates 
the performance of our algorithm for higher values of p. 
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= 1, p = 0.2 (Jo = 8, p = 0.2 



Figure 3: Predicted paths in five simulations. First column: low noise (tJo = 1). Second column: 
high noise (cTq = 8). The blue lines correspond to our algorithm, the red lines correspond to the 
Bayesian algorithm, and the dashed black line represents the true states. 

6 Related work 

The generative approach to tracking has roots in control and estimation theory, starting with the 
seminal work of Kalman 1 1 1. The most popular generative method used in tracking is the particle 
filter |2|, and its numerous variants. The literature here is vast, and there have been many exciting 
developments in recent years {e.g. Bl fTSl ): we refer the reader to lfT4l for a detailed survey of the 
results. 

The suboptimality of the Bayesian algorithm under model mismatch has been investigated in other 
contexts such as classification |15 16 1. The view of the Bayesian algorithm as an online learn- 
ing algorithm for log-loss is well-known in various communities, including information theory / 
MDL 1 17 18 1 and computational learning theory |,19, ,20] . In our work, we look beyond the 
Bayesian algorithm and log-loss to consider other loss functions and algorithms that are more ap- 
propriate for our task. 

There has also been some work on tracking in the online learning literature (see, for example, 11211 
1221 ): there, however, they study a very different model for tracking. 

7 Conclusions 

In this paper, we introduce an explanatory framework for tracking based on online learning, which 
broadens the space for designing algorithms that need not conform to the standard Bayesian ap- 
proach to tracking. We propose a new algorithm for tracking in this framework that deviates sig- 
nificantly from the Bayesian approach. Experimental results show that our algorithm significantly 
outperforms the Bayesian algorithm, even when the observations are generated by a distribution 
deviating just slightly from the model supplied to the Bayesian algorithm. Our work reveals an 
interesting connection between decision theoretic online learning and Bayesian filtering. 
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